reduce GRPC transient failures #1186

jpeeler · 2019-12-11T05:21:21Z

Description of the change:
This adds an additional function in the syncCatalogSource sync chain to ensure the GRPC readiness probes are reporting as ready before attempting any GPRC operations.

Motivation for the change:
The idea is to reduce transient errors as reported on the catalog source status. Unfortunately, I've only been able to reliably improve the catalog sources that are of type grpc and not the ones that are backed by configmaps (which is what the reported bug is specifically about).

There's also some work done to make debugging easier in the future.

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /docs
Commit messages sensible and descriptive

openshift-ci-robot · 2019-12-11T05:21:26Z

@jpeeler: This pull request references Bugzilla bug 1768819, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

bug 1768819: reduce GRPC transient failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-12-11T05:21:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpeeler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jpeeler]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

awgreene · 2019-12-11T11:50:38Z

/retest

jpeeler · 2019-12-11T14:18:29Z

/retest

jpeeler · 2019-12-11T15:57:10Z

/retest

exdx · 2019-12-11T19:00:33Z

pkg/controller/operators/catalog/operator.go

+	}
+
+	if len(pods) > 0 {
+		for _, cond := range pods[0].Status.Conditions {


Here we are getting an array of pods but only checking the condition of the pod with index 0 so the first one. Can we can always expect to get one pod in the pods list, or if the first one is ready then the others must be as well? If we only look at the one, can we use a Get instead of a List when making the request on L635?

Hrm, good point. I'm pretty sure the mapping is always 1:1...

Turns out that Get can only be queried by name, so I'm sticking with List here.

pkg/controller/registry/reconciler/reconciler.go

exdx · 2019-12-11T19:12:27Z

pkg/controller/operators/catalog/operator.go

+		return
+	}
+
+	logger.Infof("backing pod not yet ready")


maybe we could add "pod status %s", pods[0].Status.Message to this log message?

Maybe a good compromise is to indicate the pod name here? The message seems to be blank in my testing and printing the whole status block is going to be too much.

exdx · 2019-12-11T19:22:26Z

Interesting how this only works for image based catalog sources but not those based off config maps. The only real difference for the configmap based ones is that we build have to seed the DB manually from the configmap before serving gRPC connections. I wonder if they gRPC healthprobe is trying to query the pod before that action is complete, and after a series of failures it starts to back off and act funny. I know we were looking at something like this in the BZ. Maybe we can test the configmap case with a delay in the readiness probe (until after the gRPC connection is ready) and see if that improves behavior?

njhale

Thanks for doing this @jpeeler! Looking nice.

For now, I just have some thoughts on implementation:

njhale · 2019-12-11T18:29:44Z

pkg/controller/operators/catalog/operator.go

 	return
 }

+func (o *Operator) waitForBackingPod(logger *logrus.Entry, in *v1alpha1.CatalogSource) (out *v1alpha1.CatalogSource, continueSync bool, syncError error) {


The name makes me think that this method is blocking, when it's not.

I changed it to checkBackingPodStatus.

njhale · 2019-12-11T18:34:43Z

pkg/controller/operators/catalog/operator.go

 	chain := []CatalogSourceSyncFunc{
 		o.syncConfigMap,
 		o.syncRegistryServer,
+		o.waitForBackingPod,


Won't this also be executed for "address" type CatalogSources (i.e. no pods to check readiness for)?

I honestly totally forgot about catalog sources of that type. I ended up adding a simple early return for that scenario.

pkg/controller/operators/catalog/operator.go

njhale · 2019-12-11T18:42:36Z

pkg/controller/operators/catalog/operator.go

 	return
 }

+func (o *Operator) waitForBackingPod(logger *logrus.Entry, in *v1alpha1.CatalogSource) (out *v1alpha1.CatalogSource, continueSync bool, syncError error) {


Can this be a member of the registry reconciler?

It could be, but I'm not sure it should be. With the way the code is structured and higher level operations (which I consider pod checking to be) present in the main operator code, it seems best to me where it is. There is a comment to add memoization for ReconcileForSource, which if done later on would make the requested refactoring a bit less invasive.

njhale · 2019-12-11T18:43:00Z

pkg/controller/operators/catalog/operator.go

+		for _, cond := range pods[0].Status.Conditions {
+			if cond.Type == corev1.ContainersReady && cond.Status == corev1.ConditionTrue {
+				continueSync = true
+				return
+			}
+		}


Can this be a member of the pod decorator used by the registry reconciler?

To me, this might be more appropriate if the shared code (in this case pod status checking) was not shared by the majority of catalog source types (configmap, grpc). The recently added if statement to ignore grpc address types isn't too ugly, do you agree?

pkg/controller/registry/reconciler/reconciler.go

test/e2e/catalog_e2e_test.go

njhale · 2019-12-11T19:15:19Z

test/e2e/catalog_e2e_test.go

+
+	crc := newCRClient(t)
+	selector := labels.SelectorFromSet(map[string]string{reconciler.CatalogSourceLabelKey: sourceName})
+	catSrcWatcher, err := crc.OperatorsV1alpha1().CatalogSources(testNamespace).Watch(metav1.ListOptions{LabelSelector: selector.String()})


Nice! You can also use a field selector on the name instead of using a label.

Since awaitPods also uses a label selector, will stick with consistency.

test/e2e/catalog_e2e_test.go

njhale · 2019-12-11T19:22:48Z

test/e2e/catalog_e2e_test.go

+		for {
+			select {
+			case <-ctx.Done():
+				return
+			case evt, ok := <-catSrcWatcher.ResultChan():
+				if !ok {
+					errExit <- errors.New("watch channel closed unexpectedly")
+					return
+				}
+				if evt.Type == watch.Modified {
+					catSrc, ok := evt.Object.(*v1alpha1.CatalogSource)
+					if !ok {
+						errExit <- errors.New("watch returned unexpected object type")
+						return
+					}
+					t.Logf("connectionState=%v", catSrc.Status.GRPCConnectionState)
+					if catSrc.Status.GRPCConnectionState != nil {
+						if catSrc.Status.GRPCConnectionState.LastObservedState == connectivity.Ready.String() {
+							done <- struct{}{}
+						}
+						require.NotEqual(t, connectivity.TransientFailure.String(), catSrc.Status.GRPCConnectionState.LastObservedState)
+					}
+				}
+
+			}
+		}


Since the watch was started earler, I think this can be merged with the for {...} at the bottom of this test.

I don't want to miss any state changes, so I really want to start monitoring the watch before the catalog source is created. Since the result chan is unbuffered, I think that's the only safe way to do it, right?

ecordell · 2019-12-12T15:34:02Z

pkg/controller/operators/catalog/operator.go

 	}
 	op.sources = grpc.NewSourceStore(logger, 10*time.Second, 10*time.Minute, op.syncSourceState)
-	op.reconciler = reconciler.NewRegistryReconcilerFactory(lister, opClient, configmapRegistryImage, op.now)
+	op.reconciler = reconciler.NewRegistryReconcilerFactory(lister, opClient, configmapRegistryImage, op.now, op.logger.IsLevelEnabled(logrus.DebugLevel))


I would lean towards just passing in a logrus.Entry instance instead of a debug boolean

Is the reason why so you can be able to have access to the operator logger in places that aren't currently possible?

I've done so now - at least to the level of generating the pod spec.

jpeeler · 2019-12-13T01:58:26Z

/retest

ecordell · 2019-12-13T18:07:12Z

pkg/controller/operators/catalog/operator.go

 		}
+		return
+	}
+	if err := o.catsrcQueueSet.Requeue(state.Key.Namespace, state.Key.Name); err != nil {


isn't this just requeuing after every state change? is that desired?

I actually do so that the status stays in sync with what the grpc connection is reporting. This only adds additional requeues when the connection is reported as ready as requeues were already being done otherwise. I could refactor since a lot of the above is similar if you agree.

I went ahead and refactored it.

ecordell · 2019-12-13T18:08:58Z

pkg/controller/operators/catalog/operator.go

+
+	out = in.DeepCopy()
+
+	selector := map[string]string{reconciler.CatalogSourceLabelKey: in.GetName()}


shouldn't this use the pod decorator to get this selector?

I didn't see a way to have access to the decorator at this level, so I refactored to put some commonality into the reconciler package and use that instead.

jpeeler · 2019-12-16T15:17:27Z

/test unit

This is an attempt to eradicate all the transient failures that GRPC connections occasionally report. These "failures" usually resolve after about 20 seconds or so, but seeing the word failure is really unsettling to users. The fix here is to confirm that a catalog source pod readiness check has been completed before attempting to set up a GRPC connection.

The flags weren't being parsed correctly and was crashing, so just remove this unnecessary code.

Also adds some labeling on some test pods, but the new code specifically ignores them because they are for grpc address pods.

Allow the launched catalog pods to be aware of whether or not the log level is set to debug level and add additional GRPC logging if so. In the future other debugging can be added without the somewhat large amount of changes as seen here.

jpeeler · 2020-01-25T11:07:23Z

/retest

jpeeler · 2020-01-26T01:44:03Z

/retest

jpeeler · 2020-01-31T23:00:34Z

/retest
If this doesn't pass in a week or so, I'll probably just close it. But if does pass, I think it should be merged!

jpeeler · 2020-02-01T22:44:17Z

/retest

openshift-ci-robot · 2020-02-04T22:41:48Z

@jpeeler: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

reduce GRPC transient failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Dec 11, 2019

openshift-ci-robot requested review from ecordell and njhale December 11, 2019 05:21

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 11, 2019

jpeeler force-pushed the grpc-transient branch from 3dbcda4 to 6d1bfe0 Compare December 11, 2019 18:40

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 11, 2019

jpeeler force-pushed the grpc-transient branch from 6d1bfe0 to 173351e Compare December 11, 2019 18:40

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 11, 2019

exdx reviewed Dec 11, 2019

View reviewed changes

pkg/controller/registry/reconciler/reconciler.go Show resolved Hide resolved

exdx reviewed Dec 11, 2019

View reviewed changes

njhale reviewed Dec 11, 2019

View reviewed changes

ecordell reviewed Dec 12, 2019

View reviewed changes

jpeeler force-pushed the grpc-transient branch from 173351e to c9ec546 Compare December 12, 2019 21:02

openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2019

jpeeler force-pushed the grpc-transient branch from c9ec546 to 1c58f8f Compare December 12, 2019 21:02

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 12, 2019

jpeeler force-pushed the grpc-transient branch from 1c58f8f to 420274c Compare December 12, 2019 22:15

ecordell reviewed Dec 13, 2019

View reviewed changes

jpeeler force-pushed the grpc-transient branch 2 times, most recently from aa7152e to b4e2c24 Compare December 13, 2019 20:26

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 13, 2019

jpeeler force-pushed the grpc-transient branch from b4e2c24 to 608dfac Compare December 13, 2019 20:27

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 13, 2019

jpeeler force-pushed the grpc-transient branch from 608dfac to a5ab960 Compare December 13, 2019 23:36

Jeff Peeler added 4 commits January 24, 2020 14:22

chore(e2e): klog defaults to logtostderr=true

7cd6125

The flags weren't being parsed correctly and was crashing, so just remove this unnecessary code.

test(e2e): add test for GRPC transient failures

4a0f3d9

Also adds some labeling on some test pods, but the new code specifically ignores them because they are for grpc address pods.

jpeeler force-pushed the grpc-transient branch from a5ab960 to 6fe3988 Compare January 24, 2020 19:39

ecordell changed the title ~~bug 1768819: reduce GRPC transient failures~~ reduce GRPC transient failures Feb 4, 2020

openshift-ci-robot removed the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 4, 2020

jpeeler closed this Feb 29, 2020

exdx mentioned this pull request Oct 14, 2021

fix: grpc connections #2386

Closed

5 tasks


		out = in.DeepCopy()

		selector := map[string]string{reconciler.CatalogSourceLabelKey: in.GetName()}

Uh oh!

reduce GRPC transient failures #1186

reduce GRPC transient failures #1186

Uh oh!

Conversation

jpeeler commented Dec 11, 2019

Uh oh!

openshift-ci-robot commented Dec 11, 2019

Uh oh!

openshift-ci-robot commented Dec 11, 2019

Uh oh!

awgreene commented Dec 11, 2019

Uh oh!

jpeeler commented Dec 11, 2019

Uh oh!

jpeeler commented Dec 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

exdx commented Dec 11, 2019

Uh oh!

njhale left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpeeler commented Dec 13, 2019

Uh oh!

ecordell Dec 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ecordell Dec 13, 2019 •

edited

Loading